The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.
It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.
Credit card fraud happens when consumers give their credit card number to unfamiliar individuals, when cards are lost or stolen, when mail is diverted from the intended recipient and taken by criminals, or when employees of a business copy the cards or card numbers of a cardholder
In recent years credit card usage is predominant in modern day society and credit card fraud is keep on growing. Financial losses due to fraud affect not only merchants and banks (e.g. reimbursements), but also individual clients. If the bank loses money, customers eventually pay as well through higher interest rates, higher membership fees, etc. Fraud may also affect the reputation and image of a merchant causing non-financial losses that, though difficult to quantify in the short term, may become visible in the long period.
A Fraud Detection System (FDS) should not only detect fraud cases efficiently, but also be cost-effective in the sense that the cost invested in transaction screening should not be higher than the loss due to frauds . The predictive model scores each transaction with high or low risk of fraud and those with high risk generate alerts. Investigators check these alerts and provide a feedback for each alert, i.e. true positive (fraud) or false positive (genuine).
Most banks considers huge transactions, among which very few is fraudulent, often less than 0.1% . Also, only a limited number of transactions can be checked by fraud investigators, i.e. we cannot ask a human person to check all transactions one by one if it is fraudulent or not.
Alternatively, with Machine Learning (ML) techniques we can efficiently discover fraudulent patterns and predict transactions that are probably to be fraudulent. ML techniques consist in inferring a prediction model on the basis of a set of examples. The model is in most cases a parametric function, which allows predicting the likelihood of a transaction to be fraud, given a set of features describing the transaction.
Fraud detection is a binary classification task in which any transaction will be predicted and labeled as a fraud or legit. In this Notebook state of the art classification techniques were tried for this task and their performances were compared.
import pandas as pd
import numpy as np
from scipy import stats
import researchpy as rp
from datetime import datetime, timedelta
from collections import Counter
from numpy import where
import warnings
warnings.filterwarnings("ignore")
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objs as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import matplotlib.gridspec as gridspec
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import BorderlineSMOTE
from imblearn.over_sampling import SVMSMOTE
from imblearn.over_sampling import ADASYN
from lightgbm import LGBMClassifier
from sklearn import metrics
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import confusion_matrix,auc,roc_curve
from sklearn.metrics import average_precision_score
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
data= pd.read_csv(r"C:\Users\mirza\OneDrive\Data Science\data\creditcard.csv")
data.head()
| Time | V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | ... | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | Amount | Class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | -1.359807 | -0.072781 | 2.536347 | 1.378155 | -0.338321 | 0.462388 | 0.239599 | 0.098698 | 0.363787 | ... | -0.018307 | 0.277838 | -0.110474 | 0.066928 | 0.128539 | -0.189115 | 0.133558 | -0.021053 | 149.62 | 0 |
| 1 | 0.0 | 1.191857 | 0.266151 | 0.166480 | 0.448154 | 0.060018 | -0.082361 | -0.078803 | 0.085102 | -0.255425 | ... | -0.225775 | -0.638672 | 0.101288 | -0.339846 | 0.167170 | 0.125895 | -0.008983 | 0.014724 | 2.69 | 0 |
| 2 | 1.0 | -1.358354 | -1.340163 | 1.773209 | 0.379780 | -0.503198 | 1.800499 | 0.791461 | 0.247676 | -1.514654 | ... | 0.247998 | 0.771679 | 0.909412 | -0.689281 | -0.327642 | -0.139097 | -0.055353 | -0.059752 | 378.66 | 0 |
| 3 | 1.0 | -0.966272 | -0.185226 | 1.792993 | -0.863291 | -0.010309 | 1.247203 | 0.237609 | 0.377436 | -1.387024 | ... | -0.108300 | 0.005274 | -0.190321 | -1.175575 | 0.647376 | -0.221929 | 0.062723 | 0.061458 | 123.50 | 0 |
| 4 | 2.0 | -1.158233 | 0.877737 | 1.548718 | 0.403034 | -0.407193 | 0.095921 | 0.592941 | -0.270533 | 0.817739 | ... | -0.009431 | 0.798278 | -0.137458 | 0.141267 | -0.206010 | 0.502292 | 0.219422 | 0.215153 | 69.99 | 0 |
5 rows × 31 columns
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 284807 entries, 0 to 284806 Data columns (total 31 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Time 284807 non-null float64 1 V1 284807 non-null float64 2 V2 284807 non-null float64 3 V3 284807 non-null float64 4 V4 284807 non-null float64 5 V5 284807 non-null float64 6 V6 284807 non-null float64 7 V7 284807 non-null float64 8 V8 284807 non-null float64 9 V9 284807 non-null float64 10 V10 284807 non-null float64 11 V11 284807 non-null float64 12 V12 284807 non-null float64 13 V13 284807 non-null float64 14 V14 284807 non-null float64 15 V15 284807 non-null float64 16 V16 284807 non-null float64 17 V17 284807 non-null float64 18 V18 284807 non-null float64 19 V19 284807 non-null float64 20 V20 284807 non-null float64 21 V21 284807 non-null float64 22 V22 284807 non-null float64 23 V23 284807 non-null float64 24 V24 284807 non-null float64 25 V25 284807 non-null float64 26 V26 284807 non-null float64 27 V27 284807 non-null float64 28 V28 284807 non-null float64 29 Amount 284807 non-null float64 30 Class 284807 non-null int64 dtypes: float64(30), int64(1) memory usage: 67.4 MB
data.isnull().sum()
Time 0 V1 0 V2 0 V3 0 V4 0 V5 0 V6 0 V7 0 V8 0 V9 0 V10 0 V11 0 V12 0 V13 0 V14 0 V15 0 V16 0 V17 0 V18 0 V19 0 V20 0 V21 0 V22 0 V23 0 V24 0 V25 0 V26 0 V27 0 V28 0 Amount 0 Class 0 dtype: int64
data.describe()
| Time | V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | ... | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | Amount | Class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 284807.000000 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | ... | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 284807.000000 | 284807.000000 |
| mean | 94813.859575 | 3.918649e-15 | 5.682686e-16 | -8.761736e-15 | 2.811118e-15 | -1.552103e-15 | 2.040130e-15 | -1.698953e-15 | -1.893285e-16 | -3.147640e-15 | ... | 1.473120e-16 | 8.042109e-16 | 5.282512e-16 | 4.456271e-15 | 1.426896e-15 | 1.701640e-15 | -3.662252e-16 | -1.217809e-16 | 88.349619 | 0.001727 |
| std | 47488.145955 | 1.958696e+00 | 1.651309e+00 | 1.516255e+00 | 1.415869e+00 | 1.380247e+00 | 1.332271e+00 | 1.237094e+00 | 1.194353e+00 | 1.098632e+00 | ... | 7.345240e-01 | 7.257016e-01 | 6.244603e-01 | 6.056471e-01 | 5.212781e-01 | 4.822270e-01 | 4.036325e-01 | 3.300833e-01 | 250.120109 | 0.041527 |
| min | 0.000000 | -5.640751e+01 | -7.271573e+01 | -4.832559e+01 | -5.683171e+00 | -1.137433e+02 | -2.616051e+01 | -4.355724e+01 | -7.321672e+01 | -1.343407e+01 | ... | -3.483038e+01 | -1.093314e+01 | -4.480774e+01 | -2.836627e+00 | -1.029540e+01 | -2.604551e+00 | -2.256568e+01 | -1.543008e+01 | 0.000000 | 0.000000 |
| 25% | 54201.500000 | -9.203734e-01 | -5.985499e-01 | -8.903648e-01 | -8.486401e-01 | -6.915971e-01 | -7.682956e-01 | -5.540759e-01 | -2.086297e-01 | -6.430976e-01 | ... | -2.283949e-01 | -5.423504e-01 | -1.618463e-01 | -3.545861e-01 | -3.171451e-01 | -3.269839e-01 | -7.083953e-02 | -5.295979e-02 | 5.600000 | 0.000000 |
| 50% | 84692.000000 | 1.810880e-02 | 6.548556e-02 | 1.798463e-01 | -1.984653e-02 | -5.433583e-02 | -2.741871e-01 | 4.010308e-02 | 2.235804e-02 | -5.142873e-02 | ... | -2.945017e-02 | 6.781943e-03 | -1.119293e-02 | 4.097606e-02 | 1.659350e-02 | -5.213911e-02 | 1.342146e-03 | 1.124383e-02 | 22.000000 | 0.000000 |
| 75% | 139320.500000 | 1.315642e+00 | 8.037239e-01 | 1.027196e+00 | 7.433413e-01 | 6.119264e-01 | 3.985649e-01 | 5.704361e-01 | 3.273459e-01 | 5.971390e-01 | ... | 1.863772e-01 | 5.285536e-01 | 1.476421e-01 | 4.395266e-01 | 3.507156e-01 | 2.409522e-01 | 9.104512e-02 | 7.827995e-02 | 77.165000 | 0.000000 |
| max | 172792.000000 | 2.454930e+00 | 2.205773e+01 | 9.382558e+00 | 1.687534e+01 | 3.480167e+01 | 7.330163e+01 | 1.205895e+02 | 2.000721e+01 | 1.559499e+01 | ... | 2.720284e+01 | 1.050309e+01 | 2.252841e+01 | 4.584549e+00 | 7.519589e+00 | 3.517346e+00 | 3.161220e+01 | 3.384781e+01 | 25691.160000 | 1.000000 |
8 rows × 31 columns
There are not any null variable. The data set contains 284,807 transactions. The mean value of all transactions is 88.35 USD while the largest transaction recorded in this data set amounts to 25,691 USD. However, as you might be guessing right now based on the mean and maximum, the distribution of the monetary value of all transactions is heavily right-skewed. The vast majority of transactions are relatively small and only a tiny fraction of transactions comes even close to the maximum. As ı told you, I cant say more than about anything about other variables because of the dataset is done PCA and some privacy policy problems.
data.Amount = np.log(data.Amount + 0.001)#İt will will make it understandable to Amount variable
data["ClassNew"]=data["Class"].apply(lambda x: "FRAUD" if x == 1 else "NOT FRAUD")
data["ClassNew"].value_counts()
NOT FRAUD 284315 FRAUD 492 Name: ClassNew, dtype: int64
print("NOT FRAUD %",
(data["ClassNew"].value_counts()[0]/data["ClassNew"].value_counts().sum())*100)
print("FRAUD %",
(data["ClassNew"].value_counts()[1]/data["ClassNew"].value_counts().sum())*100)
print(50*"-")
print("NOT FRAUD AMOUNT %",
(data["Amount"][data["ClassNew"]=="NOT FRAUD"].sum()/data["Amount"].sum())*100)
print("FRAUD % AMOUNT",
(data["Amount"][data["ClassNew"]=="FRAUD"].sum()/data["Amount"].sum())*100)
NOT FRAUD % 99.82725143693798 FRAUD % 0.1727485630620034 -------------------------------------------------- NOT FRAUD AMOUNT % 99.87508652641638 FRAUD % AMOUNT 0.12491347358363826
data.head()
| Time | V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | ... | V22 | V23 | V24 | V25 | V26 | V27 | V28 | Amount | Class | ClassNew | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | -1.359807 | -0.072781 | 2.536347 | 1.378155 | -0.338321 | 0.462388 | 0.239599 | 0.098698 | 0.363787 | ... | 0.277838 | -0.110474 | 0.066928 | 0.128539 | -0.189115 | 0.133558 | -0.021053 | 5.008105 | 0 | NOT FRAUD |
| 1 | 0.0 | 1.191857 | 0.266151 | 0.166480 | 0.448154 | 0.060018 | -0.082361 | -0.078803 | 0.085102 | -0.255425 | ... | -0.638672 | 0.101288 | -0.339846 | 0.167170 | 0.125895 | -0.008983 | 0.014724 | 0.989913 | 0 | NOT FRAUD |
| 2 | 1.0 | -1.358354 | -1.340163 | 1.773209 | 0.379780 | -0.503198 | 1.800499 | 0.791461 | 0.247676 | -1.514654 | ... | 0.771679 | 0.909412 | -0.689281 | -0.327642 | -0.139097 | -0.055353 | -0.059752 | 5.936641 | 0 | NOT FRAUD |
| 3 | 1.0 | -0.966272 | -0.185226 | 1.792993 | -0.863291 | -0.010309 | 1.247203 | 0.237609 | 0.377436 | -1.387024 | ... | 0.005274 | -0.190321 | -1.175575 | 0.647376 | -0.221929 | 0.062723 | 0.061458 | 4.816249 | 0 | NOT FRAUD |
| 4 | 2.0 | -1.158233 | 0.877737 | 1.548718 | 0.403034 | -0.407193 | 0.095921 | 0.592941 | -0.270533 | 0.817739 | ... | 0.798278 | -0.137458 | 0.141267 | -0.206010 | 0.502292 | 0.219422 | 0.215153 | 4.248367 | 0 | NOT FRAUD |
5 rows × 32 columns
plt.figure(figsize=(30,100))
unclas=data.iloc[:, 0:30]
sns.set(style="darkgrid")
grid = gridspec.GridSpec(14, 2)
for i, col in enumerate(unclas):
plt.subplot(16,2, i+1)
sns.distplot(data[col][data["ClassNew"]=="FRAUD"],
kde=True,
bins=50,
color="r")
sns.distplot(data[col][data["ClassNew"]=="NOT FRAUD"],
kde=True,
bins=50,
color="b")
plt.legend(labels=["FRAUD",
"NOT FRAUD"])
plt.xlabel("")
plt.ylabel("Density",
fontsize=19)
plt.title("Dispersion of {}".format(str(col)),
fontsize=26)
plt.tick_params(labelsize=22)
plt.tight_layout()
plt.figure(figsize=(30,50))
unclass=data.iloc[:, [3,4,5,6,11,12,13,14,16,18]]
sns.set(style="darkgrid")
for i, col in enumerate(unclass):
plt.subplot(5,2, i+1)
sns.boxplot(x=data[col],
y="ClassNew",
data=data)
plt.ylabel("Transaction Type",
fontsize=22)
plt.xlabel("")
plt.tick_params(labelsize=22)
plt.title("Dispersion of {}".format(str(col)),
fontsize=26)
plt.tight_layout()
Let's take close look the we determined variables. Firstly, we see there are many outlier variables in Not Fraud transactions according to Fraud transactions. This is interesting but there are only 492 Frauds transactions so it might be due to this.
time_class = pd.to_timedelta(data['Time'],
unit='s')
data['Time_min'] = (time_class.dt.components.minutes).astype(int)
data['Time_hour'] = (time_class.dt.components.hours).astype(int)
plt.figure(figsize=(12,5))
sns.set(style="darkgrid")
sns.distplot(data[data['ClassNew'] == "NOT FRAUD"]["Time_hour"],
color='r')
sns.distplot(data[data['ClassNew'] == "FRAUD"]["Time_hour"],
color='b')
plt.title('Fraud x NOT FRAUD Transactions by Hours',
fontsize=17)
plt.legend(labels=["FRAUD",
"NOT FRAUD"])
plt.xlim([-1,25])
plt.show()
plt.figure(figsize=(15,5))
sns.set(style="darkgrid")
sns.distplot(data[data['ClassNew'] == "NOT FRAUD"]["Time_min"],
color='r')
sns.distplot(data[data['ClassNew'] == "FRAUD"]["Time_min"],
color='b')
plt.title('FRAUD x NOT FRAUD Transactions by Minutes',
fontsize=17)
plt.legend(labels=["FRAUD",
"NOT FRAUD"])
plt.xlim([-1,61])
plt.show()
plt.figure(figsize=(26,26))
corr= data.corr()
sns.heatmap(data=corr,
annot=True,
cbar=False,
square=True,
fmt=".2%")
plt.tight_layout()
I cant see strong correlation between the variables. it reason might be pre-made PCA analysis
Outlier detection is a complex topic. The trade-off between reducing the number of transactions and thus volume of information available to my algorithms and having extreme outliers skew the results of your predictions is not easily solvable and highly depends on your data and goals. In my case, I decided to focus exclusively on ML methods and will not focus on this topic.
Standard ML techniques such as Decision Tree and Logistic Regression have a bias towards the majority class, and they tend to ignore the minority class. They tend only to predict the majority class, hence, having major misclassification of the minority class in comparison with the majority class. In more technical words, if we have imbalanced data distribution in our dataset then our model becomes more prone to the case when minority class has negligible or very lesser recall.
- There are mainly 2 mainly algorithms that are widely used for handling imbalanced class distribution.
SMOTE (Synthetic Minority Oversampling Technique) – Oversampling
SMOTE is one of the most commonly used oversampling methods to solve the imbalance problem. It generates the virtual training records by linear interpolation for the minority class. These synthetic training records are generated by randomly selecting one or more of the k-nearest neighbors for each example in the minority class. After the oversampling process, the data is reconstructed and several classification models can be applied for the processed data.
NearMiss Algorithm – Undersampling
NearMiss is an under-sampling technique. It aims to balance class distribution by randomly eliminating majority class examples. When instances of two different classes are very close to each other, we remove the instances of the majority class to increase the spaces between the two classes. This helps in the classification process. To prevent problem of information loss in most under-sampling techniques, near-neighbor methods are widely used.
def split():
X=data.drop(columns=["Class", "ClassNew"])
y=data["Class"].values
return X, y
def SMOTE():
from collections import Counter
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from numpy import where
X, y = split()
counter = Counter(y)
print(counter)
smt = SMOTE(random_state=0)
X, y = smt.fit_resample(X, y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
counter = Counter(y)
print(counter)
return X_train, X_test, y_train, y_test
def BSMOTE():
from collections import Counter
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import BorderlineSMOTE
from numpy import where
X, y = split()
counter = Counter(y)
print(counter)
bsmote= BorderlineSMOTE(random_state=0)
X, y = bsmote.fit_resample(X, y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
counter = Counter(y)
print(counter)
return X_train, X_test, y_train, y_test
def SMOTESVM():
from collections import Counter
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SVMSMOTE
from numpy import where
X, y = split()
counter = Counter(y)
print(counter)
smotesvm= SVMSMOTE(random_state=0)
X, y = smotesvm.fit_resample(X, y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
counter = Counter(y)
print(counter)
return X_train, X_test, y_train, y_test
def ADASYN():
from collections import Counter
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import ADASYN
from numpy import where
X, y = split()
counter = Counter(y)
print(counter)
adasyn= ADASYN(random_state=0)
X, y = adasyn.fit_resample(X, y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
counter = Counter(y)
print(counter)
return X_train, X_test, y_train, y_test
Borderline-SMOTE A popular extension to SMOTE involves selecting those instances of the minority class that are misclassified, such as with a k-nearest neighbor classification model. We can then oversample just those difficult instances, providing more resolution only where it may be required.
Their approach is summarized in the 2009 paper titled “Borderline Over-sampling For Imbalanced Data Classification.” An SVM is used to locate the decision boundary defined by the support vectors and examples in the minority class that close to the support vectors become the focus for generating synthetic examples.
Another approach involves generating synthetic samples inversely proportional to the density of the examples in the minority class. That is, generate more synthetic examples in regions of the feature space where the density of minority examples is low, and fewer or none where the density is high.
I defined data X and y by creating Split() Function. Then I created and added functions for each SMOTE and ADASYN method. Thus, we will be able to easily apply all of them to the classification algorithms I have determined and learn which one works better.
%time X_train1,X_test1, y_train1, y_test1 = SMOTE()
print(50*"-")
%time X_train2, X_test2, y_train2, y_test2 = BSMOTE()
print(50*"-")
%time X_train3, X_test3, y_train3, y_test3 = SMOTESVM()
print(50*"-")
%time X_train4, X_test4, y_train4, y_test4 = ADASYN()
Counter({0: 284315, 1: 492})
Counter({0: 284315, 1: 284315})
Wall time: 648 ms
--------------------------------------------------
Counter({0: 284315, 1: 492})
Counter({0: 284315, 1: 284315})
Wall time: 2.35 s
--------------------------------------------------
Counter({0: 284315, 1: 492})
Counter({0: 284315, 1: 284315})
Wall time: 13 s
--------------------------------------------------
Counter({0: 284315, 1: 492})
Counter({1: 284448, 0: 284315})
Wall time: 2.41 s
Here we see the sampling status and running times of our functions. Thus, we can facilitate decision making for each function.
A confusion matrix is a technique for summarizing the performance of a classification algorithm.
Classification accuracy alone can be misleading if you have an unequal number of observations in each class or if you have more than two classes in your dataset.
Calculating a confusion matrix can give you a better idea of what your classification model is getting right and what types of errors it is making.
True Positives (TP): These are cases in which we predicted yes (they have the pregnant), and they do have the pregnant.
True Negatives (TN): We predicted no, and they don't have the pregnant.
False Positives (FP): We predicted yes, but they don't actually have the pregnant. (Also known as a "Type I error.")
False Negatives (FN): We predicted no, but they actually do have the pregnant. (Also known as a "Type II error.")
def Models(models, X_train, X_test, y_train, y_test, title):
model = models
model.fit(X_train,y_train)
X, y = split()
train_matrix = pd.crosstab(y_train,
model.predict(X_train),
rownames=['Actual'],
colnames=['Predicted'])
test_matrix = pd.crosstab(y_test,
model.predict(X_test),
rownames=['Actual'],
colnames=['Predicted'])
matrix = pd.crosstab(y,
model.predict(X),
rownames=['Actual'],
colnames=['Predicted'])
f,(ax1,ax2,ax3) = plt.subplots(1,3,
sharey=True,
figsize=(25, 10))
m1 = sns.heatmap(train_matrix,
annot=True,
fmt=".1f",
cbar=False,
annot_kws={"size": 16},ax=ax1)
m1.set_title(title, fontsize=25)
m1.set_ylabel('Total Fraud = {}'.format(y_train.sum()),
fontsize=19,
rotation=90)
m1.set_xlabel('Accuracy score for Training Set: %{}'.format(accuracy_score(model.predict(X_train),
y_train)*100),
fontsize=19)
m2 = sns.heatmap(test_matrix,
annot=True,
fmt=".1f",
cbar=False,
annot_kws={"size": 16},ax=ax2)
m2.set_ylabel('Total Fraud = {}'.format(y_test.sum()),
fontsize=19,
rotation=90)
m2.set_xlabel('Accuracy score for Testing Set: %{}'.format(accuracy_score(model.predict(X_test),
y_test)*100),
fontsize=19)
m3 = sns.heatmap(matrix,
annot=True,
fmt=".1f",
cbar=False,
annot_kws={"size": 16},ax=ax3)
m3.set_ylabel('Total Fraud = {}'.format(y.sum()),
fontsize=19,
rotation=90)
m3.set_xlabel('Accuracy score for Total Set: %{}'.format(accuracy_score(model.predict(X),
y)*100),
fontsize=19)
plt.show()
return y, model.predict(X)
Here I created function to calculate and visualize confusion matrix.Also, ı wanted to add the distinguishing feature selection though.
title = 'Logistic Regression/SMOTE'
%time Models(LogisticRegression(),X_train1, X_test1, y_train1, y_test1, title)
title = 'Logistic Regression/BSMOTE'
%time Models(LogisticRegression(),X_train2, X_test2, y_train2, y_test2, title)
title = 'Logistic Regression/SMOTESVM'
%time y,ypred5= Models(LogisticRegression(),X_train3, X_test3, y_train3, y_test3, title)
title = 'Logistic Regression/ADASYN'
%time Models(LogisticRegression(),X_train4, X_test4, y_train4, y_test4, title)
Wall time: 2.68 s
Wall time: 2.72 s
Wall time: 2.5 s
Wall time: 1.79 s
(array([0, 0, 0, ..., 0, 0, 0], dtype=int64), array([0, 0, 0, ..., 0, 0, 0], dtype=int64))
No, it's not good.
title = 'Gaussian NB/SMOTE'
%time Models(GaussianNB(),X_train1, X_test1, y_train1, y_test1, title)
title = 'Gaussian NB/BSMOTE'
%time Models(GaussianNB(),X_train2, X_test2, y_train2, y_test2, title)
title = 'Gaussian NB/SMOTESVM'
%time y, ypred6= Models(GaussianNB(),X_train3, X_test3, y_train3, y_test3, title)
title = 'Gaussian NB/ADASYN'
%time Models(GaussianNB(),X_train4, X_test4, y_train4, y_test4, title)
Wall time: 1.52 s
Wall time: 1.45 s
Wall time: 1.47 s
Wall time: 1.54 s
(array([0, 0, 0, ..., 0, 0, 0], dtype=int64), array([0, 0, 0, ..., 0, 0, 0], dtype=int64))
No
title = 'DecisionTree Classifier/SMOTE'
%time Models(DecisionTreeClassifier(max_depth=14),X_train1, X_test1, y_train1, y_test1, title)
title = 'DecisionTree Classifier/BSMOTE'
%time y,ypred7= Models(DecisionTreeClassifier(max_depth=14),X_train2, X_test2, y_train2, y_test2, title)
title = 'DecisionTree Classifier/SMOTESVM'
%time Models(DecisionTreeClassifier(max_depth=14),X_train3, X_test3, y_train3, y_test3, title)
title = 'DecisionTree Classifier/ADASYN'
%time Models(DecisionTreeClassifier(max_depth=14),X_train4, X_test4, y_train4, y_test4, title)
Wall time: 11.4 s
Wall time: 11.6 s
Wall time: 11.4 s
Wall time: 11 s
(array([0, 0, 0, ..., 0, 0, 0], dtype=int64), array([0, 0, 0, ..., 0, 0, 0], dtype=int64))
Definitely no
title = 'Random Forest Classifier/SMOTE'
%time Models(RandomForestClassifier(random_state=0),X_train1, X_test1, y_train1, y_test1, title)
title = 'Random Forest Classifier/BSMOTE'
%time Models(RandomForestClassifier(random_state=0),X_train2, X_test2, y_train2, y_test2, title)
title = 'Random Forest Classifier/SMOTESVM'
%time Models(RandomForestClassifier(random_state=0),X_train3, X_test3, y_train3, y_test3, title)
title = 'Random Forest Classifier/ADASYN'
%time Models(RandomForestClassifier(random_state=0),X_train4, X_test4, y_train4, y_test4, title)
Wall time: 2min 14s
Wall time: 1min 59s
Wall time: 1min 56s
Wall time: 2min 20s
(array([0, 0, 0, ..., 0, 0, 0], dtype=int64), array([0, 0, 0, ..., 0, 0, 0], dtype=int64))
I think we found. 'Random Forest Classifier/SMOTE' and 'Random Forest Classifier/ADASYN' might be.
title = 'GradientBoosting Classifier/SMOTE'
%time Models(GradientBoostingClassifier(n_estimators=500,learning_rate=1,max_features=2,max_depth=2,random_state=0),X_train1, X_test1, y_train1, y_test1, title)
title = 'GradientBoosting Classifier/BSMOTE'
%time Models(GradientBoostingClassifier(n_estimators=500,learning_rate=1,max_features=2,max_depth=2,random_state=0),X_train2, X_test2, y_train2, y_test2, title)
title = 'GradientBoosting Classifier/SMOTESVM'
%time Models(GradientBoostingClassifier(n_estimators=500,learning_rate=1,max_features=2,max_depth=2,random_state=0),X_train3, X_test3, y_train3, y_test3, title)
title = 'GradientBoosting Classifier/ADASYN'
%time y, ypred8= Models(GradientBoostingClassifier(n_estimators=500,learning_rate=1,max_features=2,max_depth=2,random_state=0),X_train4, X_test4, y_train4, y_test4, title)
Wall time: 1min 16s
Wall time: 1min 16s
Wall time: 1min 17s
Wall time: 1min 14s
Oh, no
title = 'XGB Classifier/SMOTE'
%time Models(XGBClassifier(),X_train1, X_test1, y_train1, y_test1, title)
title = 'XGB Classifier/BSMOTE'
%time Models(XGBClassifier(),X_train2, X_test2, y_train2, y_test2, title)
title = 'XGB Classifier/SMOTESVM'
%time Models(XGBClassifier(),X_train3, X_test3, y_train3, y_test3, title)
title = 'XGB Classifier/ADASYN'
%time Models(XGBClassifier(),X_train4, X_test4, y_train4, y_test4, title)
[16:53:38] WARNING: ..\src\learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
Wall time: 20.5 s [16:53:59] WARNING: ..\src\learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
Wall time: 16.6 s [16:54:15] WARNING: ..\src\learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
Wall time: 15.9 s [16:54:31] WARNING: ..\src\learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
Wall time: 20.9 s
(array([0, 0, 0, ..., 0, 0, 0], dtype=int64), array([0, 0, 0, ..., 0, 0, 0], dtype=int64))
'XGB Classifier/ADASYN' and 'XGB Classifier/SMOTE' can be okey.
title = 'LGBM Classifier/SMOTE'
%time y,ypred9= Models(LGBMClassifier(),X_train1, X_test1, y_train1, y_test1, title)
title = 'LGBM Classifier/BSMOTE'
%time Models(LGBMClassifier(),X_train2, X_test2, y_train2, y_test2, title)
title = 'LGBM Classifier/SMOTESVM'
%time Models(LGBMClassifier(),X_train3, X_test3, y_train3, y_test3, title)
title = 'LGBM Classifier/ADASYN'
%time Models(LGBMClassifier(),X_train4, X_test4, y_train4, y_test4, title)
Wall time: 2.89 s
Wall time: 2.43 s
Wall time: 2.4 s
Wall time: 2.8 s
(array([0, 0, 0, ..., 0, 0, 0], dtype=int64), array([0, 0, 0, ..., 0, 0, 0], dtype=int64))
I think, no
title = 'Linear Discriminant Analysis/SMOTE'
%time Models(LinearDiscriminantAnalysis(),X_train1, X_test1, y_train1, y_test1, title)
title = 'Linear Discriminant Analysis/BSMOTE'
%time Models(LinearDiscriminantAnalysis(),X_train2, X_test2, y_train2, y_test2, title)
title = 'Linear Discriminant Analysis/SMOTESVM'
%time y,ypred10= Models(LinearDiscriminantAnalysis(),X_train3, X_test3, y_train3, y_test3, title)
title = 'Linear Discriminant Analysis/ADASYN'
%time Models(LinearDiscriminantAnalysis(),X_train4, X_test4, y_train4, y_test4, title)
Wall time: 1.45 s
Wall time: 1.47 s
Wall time: 1.44 s
Wall time: 1.47 s
(array([0, 0, 0, ..., 0, 0, 0], dtype=int64), array([0, 0, 0, ..., 0, 0, 0], dtype=int64))
No.
def Models1(models, X_train, X_test, y_train, y_test, title):
model = models
model.fit(X_train,y_train)
X, y = split()
train_matrix = pd.crosstab(y_train,
model.predict(X_train),
rownames=['Actual'],
colnames=['Predicted'])
test_matrix = pd.crosstab(y_test,
model.predict(X_test),
rownames=['Actual'],
colnames=['Predicted'])
matrix = pd.crosstab(y,
model.predict(X),
rownames=['Actual'],
colnames=['Predicted'])
ypred= model.predict(X)
fpr, tpr, thresholds = roc_curve(y, ypred)
roc_auc = auc(fpr, tpr)
precision, recall, thresholds= precision_recall_curve(y, ypred)
auc_score = auc(recall, precision)
df = pd.DataFrame(classification_report(y, ypred, digits=2,output_dict=True)).T
df.reset_index(inplace=True)
df.rename(columns={"index": "Classification Report"}, inplace=True)
fig1=ff.create_annotated_heatmap(z= train_matrix.values,
colorscale = 'OrRd',
hoverinfo='z')
fig2=ff.create_annotated_heatmap(test_matrix.values,
colorscale = 'OrRd',
hoverinfo='z')
fig3=ff.create_annotated_heatmap(matrix.values,
colorscale = 'OrRd',
hoverinfo='z')
fig4= go.Figure(data=[go.Scatter(x=fpr,
y=tpr,
mode='lines+markers',
line=dict(color='black'),
showlegend=False)])
fig5= go.Figure(data=[go.Scatter(x=[0,1], y=[0,1],
mode='lines+markers',
line = dict(dash='dash',
color='red'),
showlegend=False)])
fig6= go.Figure(data=[go.Scatter(x=recall,
y=precision,
mode='lines+markers', line=dict(color='black'),
showlegend=False)])
fig7= go.Figure(data=[go.Scatter(x=[0,1],
y=[1,0],
mode='lines+markers', line = dict(dash='dash',
color='red'),
showlegend=False)])
fig8= go.Figure(data=[go.Scatter( x=thresholds,
y=precision[1:],
mode='lines+markers',
line = dict( color='red'),
name="Precision")])
fig9= go.Figure(data=[go.Scatter( x=thresholds,
y= recall[1:],
mode='lines+markers',
line = dict(color='black'),
name="Recall")])
fig10 = go.Figure(data=[go.Table(header=dict(values=list(df.columns),
font=dict(color='black',
size=14),
fill_color='paleturquoise',
align='left'),
cells=dict(values=[df["Classification Report"], df.precision, df.recall, df["f1-score"], df.support],
fill_color='lavender',
align='left'))])
fig = make_subplots(rows=3,cols=3,
specs=[[{},{},{}],
[{"colspan":1}, {}, {}],
[{"type": "table", "colspan":3}, {}, {}]],
subplot_titles=('Train Matrix <br> %{}'.format(accuracy_score(model.predict(X_train),
y_train)*100),
'Test Matrix <br> %{}'.format(accuracy_score(model.predict(X_test),
y_test)*100),
"Matrix <br> %{}".format(accuracy_score(model.predict(X),y)*100),
f'ROC Curve (AUC={auc(fpr, tpr):.4f})',
'Precision-Recall Curve',
"Precision and Recall for Diffrent Threshold Values",
"Classification Report"))
fig.add_trace(fig1.data[0], 1, 1)
fig.add_trace(fig2.data[0], 1, 2)
fig.add_trace(fig3.data[0], 1, 3)
fig.add_trace(fig4.data[0], 2, 1)
fig.add_trace(fig5.data[0], 2, 1)
fig.add_trace(fig6.data[0], 2, 2)
fig.add_trace(fig7.data[0], 2, 2)
fig.add_trace(fig8.data[0], 2, 3)
fig.add_trace(fig9.data[0], 2, 3)
fig.add_trace(fig10.data[0], 3, 1)
annot1 = list(fig1.layout.annotations)
annot2 = list(fig2.layout.annotations)
annot3 = list(fig3.layout.annotations)
for k in range(len(annot1)):
annot1[k]['xref'] = 'x'
annot1[k]['yref'] = 'y'
for k in range(len(annot2)):
annot2[k]['xref'] = 'x2'
annot2[k]['yref'] = 'y2'
for k in range(len(annot3)):
annot3[k]['xref'] = 'x3'
annot3[k]['yref'] = 'y3'
new_annotations = annot1+annot2+annot3
for anno in new_annotations:
fig.add_annotation(anno)
fig['layout']['yaxis1'].update(title_text='ACTUAL')
fig['layout']['xaxis1'].update(title_text="PREDICTED")
fig['layout']['yaxis2'].update(title_text='ACTUAL')
fig['layout']['xaxis2'].update(title_text="PREDICTED")
fig['layout']['yaxis3'].update(title_text='ACTUAL')
fig['layout']['xaxis3'].update(title_text="PREDICTED")
fig['layout']['xaxis4'].update(title_text="False Positive Rate")
fig['layout']['yaxis4'].update(title_text="True Positive Rate")
fig['layout']['xaxis5'].update(title_text="Recall")
fig['layout']['yaxis5'].update(title_text="Precision")
fig['layout']['yaxis6'].update(title_text="Precision / Recall")
fig['layout']['xaxis6'].update(title_text="Threshold")
fig.add_annotation(dict(x=0.495,
y=0.005,
xref="paper",
yref="paper",
text='NOT FRAUD= 0 / FRAUD= 1',
font_size = 20,
showarrow=False))
fig.update_layout(height=1100,
width=1500,
font_family = 'TimesNewRoman',
font_color= "black",
font_size = 15,
title=title,
title_x = 0.5,
legend=dict(
x=0.93,
y=0.5,
traceorder='normal',
font=dict(size=12)))
fig.show()
return y, model.predict(X)
Looking at the models, I chose 4 algorithms;
After, I'm building a new function looking at the ROC, Precision Recall Curve and , Diffrent Threshold Values, Classification Report tables and making them more beautiful with visualization.
ROC Curve & Precision Recall Curve
ROC Curves summarize the trade-off between the true positive rate and false positive rate for a predictive model using different probability thresholds.
Precision-Recall curves summarize the trade-off between the true positive rate and the positive predictive value for a predictive model using different probability thresholds.
ROC curves are appropriate when the observations are balanced between each class, whereas precision-recall curves are appropriate for imbalanced datasets.
Classification Report
It is one of the performance evaluation metrics of a classification-based machine learning model. It displays your model’s precision, recall, F1 score and support. It provides a better understanding of the overall performance of our trained model. To understand the classification report of a machine learning model, you need to know all of the metrics displayed in the report. For a clear understanding, I have explained all of the metrics below so that you can easily understand the classification report of your machine learning model.
Precision: Precision is defined as the ratio of true positives to the sum of true and false positives.
Recall: Recall is defined as the ratio of true positives to the sum of true positives and false negatives.
F1 Score: The F1 is the weighted harmonic mean of precision and recall. The closer the value of the F1 score is to 1.0, the better the expected performance of the model is.
Support: Support is the number of actual occurrences of the class in the dataset. It doesn’t vary between models, it just diagnoses the performance evaluation process.
title = 'Random Forest Classifier/SMOTE'
y, ypred1=Models1(RandomForestClassifier(random_state=0),X_train1, X_test1, y_train1, y_test1, title)
title = 'Random Forest Classifier/ADASYN'
y, ypred2=Models1(RandomForestClassifier(random_state=0),X_train4, X_test4, y_train4, y_test4, title)
title = 'XGB Classifier/SMOTE'
y, ypred3=Models1(XGBClassifier(random_state=0),X_train1, X_test1, y_train1, y_test1, title)
title = 'XGB Classifier/ADASYN'
y, ypred4=Models1(XGBClassifier(random_state=0),X_train4, X_test4, y_train4, y_test4, title)
[16:59:40] WARNING: ..\src\learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[17:00:01] WARNING: ..\src\learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
fpr1, tpr1, thresh1 = roc_curve(y, ypred1, pos_label=1)
fpr2, tpr2, thresh2 = roc_curve(y, ypred2, pos_label=1)
fpr3, tpr3, thresh3 = roc_curve(y, ypred3, pos_label=1)
fpr4, tpr4, thresh4 = roc_curve(y, ypred4, pos_label=1)
fpr5, tpr5, thresh5 = roc_curve(y, ypred5, pos_label=1)
fpr6, tpr6, thresh6 = roc_curve(y, ypred6, pos_label=1)
fpr7, tpr7, thresh7 = roc_curve(y, ypred7, pos_label=1)
fpr8, tpr8, thresh8 = roc_curve(y, ypred8, pos_label=1)
fpr9, tpr9, thresh9 = roc_curve(y, ypred9, pos_label=1)
fpr10, tpr10, thresh10 = roc_curve(y, ypred10, pos_label=1)
random_probs = [0 for i in range(len(y))]
p_fpr, p_tpr, _ = roc_curve(y, random_probs, pos_label=1)
auc_score1 = roc_auc_score(y, ypred1)
auc_score2 = roc_auc_score(y, ypred2)
auc_score3 = roc_auc_score(y, ypred3)
auc_score4 = roc_auc_score(y, ypred4)
auc_score5 = roc_auc_score(y, ypred5)
auc_score6 = roc_auc_score(y, ypred6)
auc_score7 = roc_auc_score(y, ypred7)
auc_score8 = roc_auc_score(y, ypred8)
auc_score9 = roc_auc_score(y, ypred9)
auc_score10 = roc_auc_score(y, ypred10)
trace1=go.Scatter( x=fpr1,
y= tpr1,
mode='lines+markers',
line = dict(color='orange'),
name=f"Random Forest Classifier/SMOTE (AUC={auc(fpr1, tpr1):.4f})")
trace2=go.Scatter( x=fpr2,
y= tpr2,
mode='lines+markers',
line = dict(color='green'),
name=f"Random Forest Classifier/ADASYN(AUC={auc(fpr2, tpr2):.4f})")
trace3=go.Scatter( x=fpr3,
y= tpr3,
mode='lines+markers',
line = dict(color='red'),
name=f"XGB Classifier/SMOTE(AUC={auc(fpr3, tpr3):.4f})")
trace4=go.Scatter( x=fpr4,
y= tpr4,
mode='lines+markers',
line = dict(color='yellow'),
name=f"XGB Classifier/ADASYN(AUC={auc(fpr4, tpr4):.4f})")
trace5=go.Scatter( x=p_fpr,
y= p_tpr,
mode='lines+markers',
line = dict(dash='dash',
color='black'),
showlegend=False)
trace6=go.Scatter( x=fpr5,
y= tpr5,
mode='lines+markers',
line = dict( color='purple'),
name=f"Logistic Regression/SMOTESVM(AUC={auc(fpr5, tpr5):.4f})")
trace7=go.Scatter( x=fpr6,
y= tpr6,
mode='lines+markers',
line = dict( color='blue'),
name=f"Gaussian NB/SMOTESVM(AUC={auc(fpr6, tpr6):.4f})")
trace8=go.Scatter( x=fpr7,
y= tpr7,
mode='lines+markers',
line = dict( color='pink'),
name=f"DecisionTree Classifier/BSMOTE(AUC={auc(fpr7, tpr7):.4f})")
trace9=go.Scatter( x=fpr8,
y= tpr8,
mode='lines+markers',
line = dict( color='turquoise '),
name=f"GradientBoosting Classifier/ADASYN(AUC={auc(fpr8, tpr8):.4f})")
trace10=go.Scatter( x=fpr9,
y= tpr9,
mode='lines+markers',
line = dict( color='grey '),
name=f"LGBM Classifier/SMOTE(AUC={auc(fpr9, tpr9):.4f})")
trace11=go.Scatter( x=fpr10,
y= tpr10,
mode='lines+markers',
line = dict( color='black'),
name=f"Linear Discriminant Analysis/SMOTESVM(AUC={auc(fpr10, tpr10):.4f})")
fig = make_subplots(rows=1,
cols=1,
specs=[[{'type': 'scatter'}]])
fig.add_traces([trace1,trace2,
trace3,trace4,
trace5,trace6,
trace7,trace8,
trace9,trace10,
trace11],
rows=1,
cols=1)
fig.update_layout(height=600,
width=850,
title_text="ROC CURVE",
font_family = 'TimesNewRoman',
font_size = 15,
font_color= "black",
title_x = 0.5)
fig.show()
I think Random Forest Classifier/ADASYN method is better. When we look the charts, ı see the results are beter on the Random Forest Classifier/ADASYN charts. But don't forget, did not use Future Selection methods and when I got data , the data was PCA format so see the outiler data or noisy data is very hard. I would not wanted to incorrect predict. The data already was İnbalanced.
There are 35 False Positive in Forest Classifier/ADASYN method. This means, per 284772 transactions will have 35 wrong predict. When you looking first it can be looks good. But there are doing millions transactions by customers every day in the bank. This means, the banks might lock up hundered of customers account unnecessary and this would reduce bank confidence.
This method can be developed with different methods and implement feature selection or Cross Validation methods. The data reviewing again with Neural Network or using Genetic algorithms.